NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Methylation Data Analysis and Interpretation

https://doi.org/10.1146/annurev-biodatasci-120924-091033

Zhu, Yuehua; Mao, Weiguang; Hosseini, Rezwan; Chikina, Maria (August 2025, Annual Review of Biomedical Data Science)

DNA methylation, a covalent modification, fundamentally shapes mammalian gene regulation and cellular identity. This review examines methylation's biochemical underpinnings, genomic distribution patterns, and analytical approaches. We highlight three distinctive aspects that separate methylation from other epigenetic marks: its remarkable stability as a silencing mechanism, its capacity to maintain distinct states independently of DNA sequence, and its effectiveness as a quantitative trait linking genotype to disease risk. We also explore the phenomenon of methylation clocks and their biological significance. The review addresses technical considerations across major assay types—both array-based technologies and sequencing approaches—with emphasis on data normalization, quality control, cell proportion inference, and the specialized statistical models required for next-generation sequencing analysis.
more » « less
Free, publicly-accessible full text available August 11, 2026
A unified hypothesis-free feature extraction framework for diverse epigenomic data

https://doi.org/10.1093/bioadv/vbaf013

Balcı, Ali_Tuğrul; Chikina, Maria; Mahony, ed., Shaun (March 2025, Bioinformatics Advances)

Abstract MotivationEpigenetic assays using next-generation sequencing have furthered our understanding of the functional genomic regions and the mechanisms of gene regulation. However, a single assay produces billions of data points, with limited information about the biological process due to numerous sources of technical and biological noise. To draw biological conclusions, numerous specialized algorithms have been proposed to summarize the data into higher-order patterns, such as peak calling and the discovery of differentially methylated regions. The key principle underlying these approaches is the search for locally consistent patterns. ResultsWe propose L0 segmentation as a universal framework for extracting locally coherent signals for diverse epigenetic sources. L0 serves to compress the input signal by approximating it as a piecewise constant. We implement a highly scalable L0 segmentation with additional loss functions designed for sequencing epigenetic data types including Poisson loss for single tracks and binomial loss for methylation/coverage data. We show that the L0 segmentation approach retains the salient features of the data yet can identify subtle features, such as transcription end sites, missed by other analytic approaches. Availability and implementationOur approach is implemented as an R package “l01segmentation” with a C++ backend. Available at https://github.com/boooooogey/l01segmentation.
more » « less
Heterogeneous pseudobulk simulation enables realistic benchmarking of cell-type deconvolution methods

https://doi.org/10.1186/s13059-024-03292-w

Hu, Mengying; Chikina, Maria (December 2024, Genome Biology)

Abstract BackgroundComputational cell type deconvolution enables the estimation of cell type abundance from bulk tissues and is important for understanding tissue microenviroment, especially in tumor tissues. With rapid development of deconvolution methods, many benchmarking studies have been published aiming for a comprehensive evaluation for these methods. Benchmarking studies rely on cell-type resolved single-cell RNA-seq data to create simulated pseudobulk datasets by adding individual cells-types in controlled proportions. ResultsIn our work, we show that the standard application of this approach, which uses randomly selected single cells, regardless of the intrinsic difference between them, generates synthetic bulk expression values that lack appropriate biological variance. We demonstrate why and how the current bulk simulation pipeline with random cells is unrealistic and propose a heterogeneous simulation strategy as a solution. The heterogeneously simulated bulk samples match up with the variance observed in real bulk datasets and therefore provide concrete benefits for benchmarking in several ways. We demonstrate that conceptual classes of deconvolution methods differ dramatically in their robustness to heterogeneity with reference-free methods performing particularly poorly. For regression-based methods, the heterogeneous simulation provides an explicit framework to disentangle the contributions of reference construction and regression methods to performance. Finally, we perform an extensive benchmark of diverse methods across eight different datasets and find BayesPrism and a hybrid MuSiC/CIBERSORTx approach to be the top performers. ConclusionsOur heterogeneous bulk simulation method and the entire benchmarking framework is implemented in a user friendly packagehttps://github.com/humengying0907/deconvBenchmarkingandhttps://doi.org/10.5281/zenodo.8206516, enabling further developments in deconvolution methods.
more » « less
Full Text Available
Best holdout assessment is sufficient for cancer transcriptomic model selection

https://doi.org/10.1016/j.patter.2024.101115

Crawford, Jake; Chikina, Maria; Greene, Casey S (December 2024, Patterns)

Full Text Available
A hybrid constrained continuous optimization approach for optimal causal discovery from biological data

https://doi.org/10.1093/bioinformatics/btae411

Zhu, Yuehua; Benos, Panayiotis_V; Chikina, Maria (September 2024, Bioinformatics)

Abstract MotivationUnderstanding causal effects is a fundamental goal of science and underpins our ability to make accurate predictions in unseen settings and conditions. While direct experimentation is the gold standard for measuring and validating causal effects, the field of causal graph theory offers a tantalizing alternative: extracting causal insights from observational data. Theoretical analysis has shown that this is indeed possible, given a large dataset and if certain conditions are met. However, biological datasets, frequently, do not meet such requirements but evaluation of causal discovery algorithms is typically performed on synthetic datasets, which they meet all requirements. Thus, real-life datasets are needed, in which the causal truth is reasonably known. In this work we first construct such a large-scale real-life dataset and then we perform on it a comprehensive benchmarking of various causal discovery methods. ResultsWe find that the PC algorithm is particularly accurate at estimating causal structure, including the causal direction which is critical for biological applicability. However, PC does only produces cause-effect directionality, but not estimates of causal effects. We propose PC-NOTEARS (PCnt), a hybrid solution, which includes the PC output as an additional constraint inside the NOTEARS optimization. This approach combines PC algorithm’s strengths in graph structure prediction with the NOTEARS continuous optimization to estimate causal effects accurately. PCnt achieved best aggregate performance across all structural and effect size metrics. Availability and implementationhttps://github.com/zhu-yh1/PC-NOTEARS.
more » « less
Quick and effective approximation of in silico saturation mutagenesis experiments with first-order taylor expansion

https://doi.org/10.1016/j.isci.2024.110807

Sasse, Alexander; Chikina, Maria; Mostafavi, Sara (September 2024, iScience)

Full Text Available
Unlocking gene regulation with sequence-to-function models

https://doi.org/10.1038/s41592-024-02331-5

Sasse, Alexander; Chikina, Maria; Mostafavi, Sara (August 2024, Nature Methods)

Full Text Available
Accelerating NCE Convergence with Adaptive Normalizing Constant Computation

Chikina, Maria; Koes, David; Sevekari, Anish; Aggarwal, Rishal (July 2024, Open Review)

Noise Contrastive Estimation (NCE) is a widely used method for training generative models, typically used as an alternative to Maximum Likelihood Estimation (MLE) when exact computations of probability are hard. NCE trains generative models by discriminating between data and appropriately chosen noise distributions. Although NCE is statistically consistent, it suffers from slow convergence and high variance when there is small overlap between the noise and data distributions. Both these problems are related to the flatness of the NCE loss landscape. We propose an innovative approach to circumvent slow convergence rates by quick inference of the optimal normalizing constant at every gradient step. This allows the rest of the parameters to have more freedom during NCE optimization. We analyze the use of both binary search and the Bennett Acceptance Ratio (BAR) for quick computation of the normalizing constant and show improved performance for both methods on convex and non-convex settings.
more » « less
Full Text Available
InstaPrism: an R package for fast implementation of BayesPrism

https://doi.org/10.1093/bioinformatics/btae440

Hu, Mengying; Chikina, Maria; Nikolski, ed., Macha (July 2024, Bioinformatics)

Abstract SummaryComputational cell-type deconvolution is an important analytic technique for modeling the compositional heterogeneity of bulk gene expression data. A conceptually new Bayesian approach to this problem, BayesPrism, has recently been proposed and has subsequently been shown to be superior in accuracy and robustness against model misspecifications by independent studies; however, given that BayesPrism relies on Gibbs sampling, it is orders of magnitude more computationally expensive than standard approaches. Here, we introduce the InstaPrism package which re-implements BayesPrism in a derandomized framework by replacing the time-consuming Gibbs sampling step with a fixed-point algorithm. We demonstrate that the new algorithm is effectively equivalent to BayesPrism while providing a considerable speed and memory advantage. Furthermore, the InstaPrism package is equipped with a precompiled, curated set of references tailored for a variety of cancer types, streamlining the deconvolution process. Availability and implementationThe package InstaPrism is freely available at: https://github.com/humengying0907/InstaPrism. The source code and evaluation pipeline used in this paper can be found at: https://github.com/humengying0907/InstaPrismSourceCode.
more » « less
RERconverge Expansion: Using Relative Evolutionary Rates to Study Complex Categorical Trait Evolution

https://doi.org/10.1093/molbev/msae210

Redlich, Ruby; Kowalczyk, Amanda; Tene, Michael; Sestili, Heather H; Foley, Kathleen; Saputra, Elysia; Clark, Nathan; Chikina, Maria; Meyer, Wynn K; Pfenning, Andreas R (November 2024, Molecular Biology and Evolution)
Zhou, Xuming (Ed.)
Abstract Comparative genomics approaches seek to associate molecular evolution with the evolution of phenotypes across a phylogeny. Many of these methods lack the ability to analyze non-ordinal categorical traits with more than two categories. To address this limitation, we introduce an expansion to RERconverge that associates shifts in evolutionary rates with the convergent evolution of categorical traits. The categorical RERconverge expansion includes methods for performing categorical ancestral state reconstruction, statistical tests for associating relative evolutionary rates with categorical variables, and a new method for performing phylogeny-aware permutations, “permulations”, on categorical traits. We demonstrate our new method on a three-category diet phenotype, and we compare its performance to binary RERconverge analyses and two existing methods for comparative genomic analyses of categorical traits: phylogenetic simulations and a phylogenetic signal based method. We present an analysis of how the categorical permulations scale with the number of species and the number of categories included in the analysis. Our results show that our new categorical method outperforms phylogenetic simulations at identifying genes and enriched pathways significantly associated with the diet phenotypes and that the categorical ancestral state reconstruction drives an improvement in our ability to capture diet-related enriched pathways compared to binary RERconverge when implemented without user input on phenotype evolution. The categorical expansion to RERconverge will provide a strong foundation for applying the comparative method to categorical traits on larger data sets with more species and more complex trait evolution than have previously been analyzed.
more » « less
Full Text Available

« Prev Next »

Search for: All records